Block Heavy Hitters

نویسندگان

  • Alexandr Andoni
  • Khanh Do Ba
  • Piotr Indyk
چکیده

We study a natural generalization of the heavy hitters problem in the streaming context. We term this generalization block heavy hitters and define it as follows. We are to stream over a matrix A, and report all rows that are heavy, where a row is heavy if its `1-norm is at least φ fraction of the `1 norm of the entire matrix A. In comparison, in the standard heavy hitters problem, we are required to report the matrix entries that are heavy. As is common in streaming, we solve the problem approximately: we return all rows with weight at least φ, but also possibly some other rows that have weight no less than (1− )φ. To solve the block heavy hitters problem, we show how to construct a linear sketch of A from which we can recover the heavy rows of A. The block heavy hitters problem has already found applications for other streaming problems. In particular, it is a crucial building block in a streaming algorithm of [AIK08] that constructs a small-size sketch for the Ulam metric, a metric on non-repetitive strings under the edit (Levenshtein) distance. We prove the following theorem. Let Mn,m be the set of real matrices A of size n by m, with entries from E = 1 nm · {0, 1, . . . nm}. For a matrix A, let Ai denote its i th row. Theorem 0.1. Fix some > 0, and n,m ≥ 1, and φ ∈ [0, 1]. There exists a randomized linear map (sketch) μ : Mn,m → {0, 1}s, where s = O( 1 5φ2 log n), such that the following holds. For a matrix A ∈ Mn,m, it is possible, given μ(A), to find a set W ⊂ [n] of rows such that, with probability at least 1− 1/n, we have: • for any i ∈W , ‖Ai‖1 ‖A‖1 ≥ (1− )φ and • if ‖Ai‖1 ‖A‖1 ≥ φ, then i ∈W . Moreover, μ can be of the form μ(A) = μ(ρ(A1), ρ(A2), . . . ρ(An)), where ρ : Em → Rk and μ′ : Rkn → {0, 1}s are randomized linear mappings. That is, the sketch μ is obtained by first sketching the rows of A (using the same function ρ) and then sketching those sketches. Our construction is inspired by the CountMin sketch of [CM05], and may be seen as a CountMin sketch on the projections of the rows of A. Proof. Construction of the sketch. We define the function ρ as an `1 projection into a space with k = O( 1 2 log n) dimensions, achieved through a standard Cauchy distribution projection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison between multistage filters and sketches for finding heavy hitters

The purpose of this write-up is to compare multistage filters [3] and sketches with respect to their ability to identify heavy hitters. In a nutshell, the conclusion is that multistage filters as I use them identify heavy hitters with less memory than sketches, but some sketches support important other operations, more specifically they can be added and subtracted without any need to re-read th...

متن کامل

New Algorithms for Heavy Hitters in Data Streams

An old and fundamental problem in databases and data streams is that of finding the heavy hitters, also known as the top-k, most popular items, frequent items, elephants, or iceberg queries. There are several variants of this problem, which quantify what it means for an item to be frequent, including what are known as the `1-heavy hitters and `2-heavy hitters. There are a number of algorithmic ...

متن کامل

New Algorithms for Heavy Hitters in Data Streams (Invited Talk)

An old and fundamental problem in databases and data streams is that of finding the heavy hitters, also known as the top-k, most popular items, frequent items, elephants, or iceberg queries. There are several variants of this problem, which quantify what it means for an item to be frequent, including what are known as the l1-heavy hitters and l2-heavy hitters. There are a number of algorithmic ...

متن کامل

Using 2D Hierarchical Heavy Hitters to Investigate Binary Relationships

This chapter presents VHHH: a visual data mining tool to compute and investigate hierarchical heavy hitters (HHHs) for two-dimensional data. VHHH computes the HHHs for a two-dimensional categorical dataset and a given threshold, and visualizes the HHHs in the three dimensional space. The chapter evaluates VHHH on synthetic and real world data, provides an interpretation alphabet, and identifies...

متن کامل

An Optimal Algorithm for `1-Heavy Hitters in Insertion Streams and Related Problems

We give the first optimal bounds for returning the `1-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of m items in {1, 2, . . . , n} and parameters 0 < ε < φ 6 1, let fi denote the frequency of item i, i.e., the number of times item i occurs in the stream. With arbitrarily large constant probab...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008